{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 使用 k 近邻算法改进网站的配对效果"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**说明：**\n",
    "\n",
    "将数据集文件 'datingTestSet2.txt' 放在当前文件夹"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# 导入程序所需要的模块\n",
    "import numpy as np\n",
    "import operator"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 定义数据集导入函数"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "file2matrix函数实现的功能是读取文件数据，函数返回的returnMat和classLabelVector分别是数据集的特征矩阵和输出标签向量。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def file2matrix(filename):\n",
    "    love_dictionary = {'largeDoses':3, 'smallDoses':2, 'didntLike':1}    # 三个类别\n",
    "    fr = open(filename)    # 打开文件\n",
    "    arrayOLines = fr.readlines()    # 逐行打开\n",
    "    numberOfLines = len(arrayOLines)            #得到文件的行数\n",
    "    returnMat = np.zeros((numberOfLines, 3))        #初始化特征矩阵\n",
    "    classLabelVector = []                       #初始化输出标签向量\n",
    "    index = 0\n",
    "    for line in arrayOLines:\n",
    "        line = line.strip()    # 删去字符串首部尾部空字符\n",
    "        listFromLine = line.split('\\t')    # 按'\\t'对字符串进行分割，listFromLine 是列表\n",
    "        returnMat[index, :] = listFromLine[0:3]    # listFromLine的0,1,2元素是特征，赋值给returnMat的当前行\n",
    "        if(listFromLine[-1].isdigit()):    # 如果listFromLine最后一个元素是数字\n",
    "            classLabelVector.append(int(listFromLine[-1]))    # 直接赋值给classLabelVector\n",
    "        else:    # 如果listFromLine最后一个元素不是数字，而是字符串\n",
    "            classLabelVector.append(love_dictionary.get(listFromLine[-1]))    # 根据字典love_dictionary转化为数字\n",
    "        index += 1\n",
    "    return returnMat, classLabelVector    # 返回的类别标签classLabelVector是1,2,3"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 定义特征归一化函数"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**这里为什么要对特征进行归一化？**\n",
    "\n",
    "因为在处理这种不同取值范围的特征值时，数值归一化能够将不同特征的取值范围限定在同一区间例如[0,1]之间，让不同特征对距离的计算影响相同。具体可看《机器学习实战》第2.2.3节内容。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def autoNorm(dataSet):\n",
    "    minVals = dataSet.min(0)\n",
    "    maxVals = dataSet.max(0)\n",
    "    ranges = maxVals - minVals\n",
    "    normDataSet = np.zeros(np.shape(dataSet))\n",
    "    m = dataSet.shape[0]\n",
    "    normDataSet = dataSet - np.tile(minVals, (m, 1))\n",
    "    normDataSet = normDataSet/np.tile(ranges, (m, 1))   # normDataSet值被限定在[0,1]之间\n",
    "    return normDataSet, ranges, minVals"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 定义 k 近邻算法"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def classify0(inX, dataSet, labels, k):    # inX是测试集，dataSet是训练集，lebels是训练样本标签，k是取的最近邻个数\n",
    "    dataSetSize = dataSet.shape[0]    # 训练样本个数\n",
    "    diffMat = np.tile(inX, (dataSetSize, 1)) - dataSet    # np.tile: 重复n次\n",
    "    sqDiffMat = diffMat**2\n",
    "    sqDistances = sqDiffMat.sum(axis=1)\n",
    "    distances = sqDistances**0.5    # distance是inX与dataSet的欧氏距离\n",
    "    sortedDistIndicies = distances.argsort()    # 返回排序从小到达的索引位置\n",
    "    classCount = {}   # 字典存储k近邻不同label出现的次数\n",
    "    for i in range(k):\n",
    "        voteIlabel = labels[sortedDistIndicies[i]]\n",
    "        classCount[voteIlabel] = classCount.get(voteIlabel, 0) + 1    # 对应label加1，classCount中若无此key，则默认为0\n",
    "    sortedClassCount = sorted(classCount.items(), key=operator.itemgetter(1), reverse=True)    # operator.itemgetter 获取对象的哪个维度的数据\n",
    "    return sortedClassCount[0][0]    # 返回k近邻中所属类别最多的哪一类"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 测试算法：作为完整程序验证分类器"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def datingClassTest():\n",
    "    hoRatio = 0.10      #整个数据集的10%用来测试\n",
    "    datingDataMat, datingLabels = file2matrix('datingTestSet2.txt')       #导入数据集\n",
    "    normMat, ranges, minVals = autoNorm(datingDataMat)    # 所有特征归一化\n",
    "    m = normMat.shape[0]    # 样本个数\n",
    "    numTestVecs = int(m*hoRatio)    # 测试样本个数\n",
    "    errorCount = 0.0\n",
    "    for i in range(numTestVecs):\n",
    "        classifierResult = classify0(normMat[i, :], normMat[numTestVecs:m, :], datingLabels[numTestVecs:m], 3)\n",
    "        print(\"the classifier came back with: %d, the real answer is: %d\" % (classifierResult, datingLabels[i]))\n",
    "        if (classifierResult != datingLabels[i]): errorCount += 1.0\n",
    "    print(\"the total error rate is: %f\" % (errorCount / float(numTestVecs)))    # 打印错误率\n",
    "    print(errorCount)    # 打印错误个数"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "the classifier came back with: 3, the real answer is: 3\n",
      "the classifier came back with: 2, the real answer is: 2\n",
      "the classifier came back with: 1, the real answer is: 1\n",
      "the classifier came back with: 1, the real answer is: 1\n",
      "the classifier came back with: 1, the real answer is: 1\n",
      "the classifier came back with: 1, the real answer is: 1\n",
      "the classifier came back with: 3, the real answer is: 3\n",
      "the classifier came back with: 3, the real answer is: 3\n",
      "the classifier came back with: 1, the real answer is: 1\n",
      "the classifier came back with: 3, the real answer is: 3\n",
      "the classifier came back with: 1, the real answer is: 1\n",
      "the classifier came back with: 1, the real answer is: 1\n",
      "the classifier came back with: 2, the real answer is: 2\n",
      "the classifier came back with: 1, the real answer is: 1\n",
      "the classifier came back with: 1, the real answer is: 1\n",
      "the classifier came back with: 1, the real answer is: 1\n",
      "the classifier came back with: 1, the real answer is: 1\n",
      "the classifier came back with: 1, the real answer is: 1\n",
      "the classifier came back with: 2, the real answer is: 2\n",
      "the classifier came back with: 3, the real answer is: 3\n",
      "the classifier came back with: 2, the real answer is: 2\n",
      "the classifier came back with: 1, the real answer is: 1\n",
      "the classifier came back with: 3, the real answer is: 2\n",
      "the classifier came back with: 3, the real answer is: 3\n",
      "the classifier came back with: 2, the real answer is: 2\n",
      "the classifier came back with: 3, the real answer is: 3\n",
      "the classifier came back with: 2, the real answer is: 2\n",
      "the classifier came back with: 3, the real answer is: 3\n",
      "the classifier came back with: 2, the real answer is: 2\n",
      "the classifier came back with: 1, the real answer is: 1\n",
      "the classifier came back with: 3, the real answer is: 3\n",
      "the classifier came back with: 1, the real answer is: 1\n",
      "the classifier came back with: 3, the real answer is: 3\n",
      "the classifier came back with: 1, the real answer is: 1\n",
      "the classifier came back with: 2, the real answer is: 2\n",
      "the classifier came back with: 1, the real answer is: 1\n",
      "the classifier came back with: 1, the real answer is: 1\n",
      "the classifier came back with: 2, the real answer is: 2\n",
      "the classifier came back with: 3, the real answer is: 3\n",
      "the classifier came back with: 3, the real answer is: 3\n",
      "the classifier came back with: 1, the real answer is: 1\n",
      "the classifier came back with: 2, the real answer is: 2\n",
      "the classifier came back with: 3, the real answer is: 3\n",
      "the classifier came back with: 3, the real answer is: 3\n",
      "the classifier came back with: 3, the real answer is: 3\n",
      "the classifier came back with: 1, the real answer is: 1\n",
      "the classifier came back with: 1, the real answer is: 1\n",
      "the classifier came back with: 1, the real answer is: 1\n",
      "the classifier came back with: 1, the real answer is: 1\n",
      "the classifier came back with: 2, the real answer is: 2\n",
      "the classifier came back with: 2, the real answer is: 2\n",
      "the classifier came back with: 1, the real answer is: 1\n",
      "the classifier came back with: 3, the real answer is: 3\n",
      "the classifier came back with: 2, the real answer is: 2\n",
      "the classifier came back with: 2, the real answer is: 2\n",
      "the classifier came back with: 2, the real answer is: 2\n",
      "the classifier came back with: 2, the real answer is: 2\n",
      "the classifier came back with: 3, the real answer is: 3\n",
      "the classifier came back with: 1, the real answer is: 1\n",
      "the classifier came back with: 2, the real answer is: 2\n",
      "the classifier came back with: 1, the real answer is: 1\n",
      "the classifier came back with: 2, the real answer is: 2\n",
      "the classifier came back with: 2, the real answer is: 2\n",
      "the classifier came back with: 2, the real answer is: 2\n",
      "the classifier came back with: 2, the real answer is: 2\n",
      "the classifier came back with: 2, the real answer is: 2\n",
      "the classifier came back with: 3, the real answer is: 3\n",
      "the classifier came back with: 2, the real answer is: 2\n",
      "the classifier came back with: 3, the real answer is: 3\n",
      "the classifier came back with: 1, the real answer is: 1\n",
      "the classifier came back with: 2, the real answer is: 2\n",
      "the classifier came back with: 3, the real answer is: 3\n",
      "the classifier came back with: 2, the real answer is: 2\n",
      "the classifier came back with: 2, the real answer is: 2\n",
      "the classifier came back with: 3, the real answer is: 1\n",
      "the classifier came back with: 3, the real answer is: 3\n",
      "the classifier came back with: 1, the real answer is: 1\n",
      "the classifier came back with: 1, the real answer is: 1\n",
      "the classifier came back with: 3, the real answer is: 3\n",
      "the classifier came back with: 3, the real answer is: 3\n",
      "the classifier came back with: 1, the real answer is: 1\n",
      "the classifier came back with: 2, the real answer is: 2\n",
      "the classifier came back with: 3, the real answer is: 3\n",
      "the classifier came back with: 3, the real answer is: 1\n",
      "the classifier came back with: 3, the real answer is: 3\n",
      "the classifier came back with: 1, the real answer is: 1\n",
      "the classifier came back with: 2, the real answer is: 2\n",
      "the classifier came back with: 2, the real answer is: 2\n",
      "the classifier came back with: 1, the real answer is: 1\n",
      "the classifier came back with: 1, the real answer is: 1\n",
      "the classifier came back with: 3, the real answer is: 3\n",
      "the classifier came back with: 2, the real answer is: 3\n",
      "the classifier came back with: 1, the real answer is: 1\n",
      "the classifier came back with: 2, the real answer is: 2\n",
      "the classifier came back with: 1, the real answer is: 1\n",
      "the classifier came back with: 3, the real answer is: 3\n",
      "the classifier came back with: 3, the real answer is: 3\n",
      "the classifier came back with: 2, the real answer is: 2\n",
      "the classifier came back with: 1, the real answer is: 1\n",
      "the classifier came back with: 3, the real answer is: 1\n",
      "the total error rate is: 0.050000\n",
      "5.0\n"
     ]
    }
   ],
   "source": [
    "datingClassTest()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 使用算法：构建完整可用系统"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "根据用户输入，在线判断类别。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def classifyPerson():\n",
    "    resultList = ['not at all', 'in small doses', 'in large doses']\n",
    "    percentTats = float(input(\\\n",
    "                                  \"percentage of time spent playing video games?\"))\n",
    "    ffMiles = float(input(\"frequent flier miles earned per year?\"))\n",
    "    iceCream = float(input(\"liters of ice cream consumed per year?\"))\n",
    "    datingDataMat, datingLabels = file2matrix('datingTestSet2.txt')\n",
    "    normMat, ranges, minVals = autoNorm(datingDataMat)\n",
    "    inArr = np.array([ffMiles, percentTats, iceCream, ])\n",
    "    classifierResult = classify0((inArr - \\\n",
    "                                  minVals)/ranges, normMat, datingLabels, 3)\n",
    "    print(\"You will probably like this person: %s\" % resultList[classifierResult - 1])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "percentage of time spent playing video games?10\n",
      "frequent flier miles earned per year?5000\n",
      "liters of ice cream consumed per year?20\n",
      "You will probably like this person: in small doses\n"
     ]
    }
   ],
   "source": [
    "classifyPerson()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
